A Matrix Representation Of The Inflectional Forms Of Arabic Words: A Study Of Co-Occurrence Patterns
نویسندگان
چکیده
s e q u e n c e of t h r ee l e t t e r s , called a t r i l i t e ra l root . A p r o p o s e d "Matrix" method fo r the r e p r e s e n t a t i o n of the inf lect ional p a r a d i g m s of Arabic words is p r e s e n t e d . This r e p r e s e n t a t i o n r e s u l t s in a c lass i f ica t ion of Arabic w o r d s into a t ree s t r u c t u r e ( F i g ( l ) ) whose leaves r e p r e s e n t u n i q u e conjuga t iona l or de r iva t iona l p a r a d i g m s , each r e p r e s e n t e d in the p r o p o s e d "Matrix" form. A study of about 2,500 stems from a high frequency Arabic wordList due to Landau has revealed a systematic set of co-occurrence patterns for the encLitic pronouns of Arabic verbs and for the possessive pronouns attached to Arabic nouns. Each co-occurrence pattern represents a subcategorization frame that reflects the underlying semantic relationship. The key f e a t u r e tha t d i s t i n g u i s h e s these semant ic p a t t e r n s has been o b s e r v e d to be w h e t h e r the a t t a ched s u f f i x e s re la te to the animate or inanimate . In some cases fo r v e r b s , the n u m b e r of the sub j ec t is also a s ign i f i can t f e a t u r e . These semant ic f e a t u r e s also e x t e n d to n o n a t t a c h e d s u b j e c t s and objec ts ( fo r v e r b s ) and to p o s s e s s i v e noun complements ( f o r n o u n s ) . T h e r e f o r e the semant ic c l a s ses p r e s e n t e d in this p a p e r also a s s i s t in syntactic/semantic analysis. The first application that Was developed, based upon the proposed representaion is a stem-based Arabic morphological ans/yser, from which a spell checker (on a PS/2 microcomputer) emerged as a by-product. Currently, the system is being used to interact with an Arabic syntactic parser and there are plans to use it in a machine assisted translation system. i . INTRODUCTION Over the p a s t few y e a r s t he r e has been a marked i nc r ea se in the use of c o m p u t e r s in the Arabic s p e a k i n g c o u n t r i e s . Many app l ica t ions p r o g r a m s in Arabic have been deve loped , b u t the field of compu ta t i ons / l ingu i s t i c s is r e l a t ive ly new in Arabic and p r e s e n t s a u n i q u e cha l lenge , due to the h i gh ly inf lec ted n a t u r e of the Arabic l a n g u a g e . In the p r e s e n t work , we have a t t e m p t e d to r e p r e s e n t the morphologica l r u l e s g o v e r n i n g the inf lec t ions of Arabic w o r d s in a compact form which can s impl i fy the p r o c e s s i n g of Arabic words b y c o m p u t e r s and which is i n d e p e n d e n t of the a p a r t i c u l a r appl ica t ion . T h e r e have been o t h e r a t t emp t s to show the con juga t ions of Arab ic v e r b s <2> b u t the t r e a t m e n t does not delve into su f f i c i en t d e p t h and not all enc l i t ics , which a re an e s sen t i a l p a r t of Arabic v e r b s , a re c o n s i d e r e d . Moreover , the t r e a t m e n t in <2> does not e x t e n d to n o u n s . By s t u d y i n g some 2,500 s tems out of a h i g h f r e q u e n c y Arabic word l i s t due to Landau <1>, ce r t a in sy s t ema t i c c o o c c u r r e n c e p a t t e r n s g o v e r n i n g v e r b encl i t ics and n o u n p o s s e s s i v e p r o n o u n s have been o b s e r v e d . These p a t t e r n s are wha t we call "Matr ices" in this p a p e r ; each u n i q u e "Matr ix" r e f l ec t s a d i f f e r e n t semant ic b e h a v i o u r . To summar ize Arab ic m o r p h o l o g y in a n u t s h e l l , a b o u t 80 ~, of Arabic words can be d e r i v e d f rom a For example , i f we c o n s i d e r the root ~ ,~ (K T B) , we can form words s u c h as . l -, .r _..£~ (?aKTuB I wr i t e ) and , ~ (KiTa:B b o o k ) , b y s u b j e c t i n g the root to v a r i o u s " forms" or "moulds" and b y u n d e r g o i n g ce r t a in m o r p h o p h o n e m i c ( and p o s s i b l y also m o r p h o g r a p h e m i c ) c h a n g e s . For a full d i s c u s s i o n of t rad i t iona l Arabic morpho logy see <9> and <10>. In this p a p e r , we shall def ine s u c h an inf lected form to be a "STEM". Thus a stem may contain infixes and certain prefixes which are part of the "mould" but may not contain any suffixes. Suffixes for verbs are subject and object pronouns, while for nouns they are possessive pronouns. One further definition which is used in the proposed representaion is the "Core"; this is simply the inflected form with all prefixes and suffixes stripped off. The core may or may not be a valid word. In compar i son wi th o t h e r work in the a r ea of t radi t ional Arabic m o r p h o l o g y (<3>,<4>), where the c o n c e r n is wi th the ru le s which cause the inf lec ted form to be de r i ved f rom the ROOT, we have s t ud i ed the rules governing the derivation of all possible inflected forms from the STEM, as defined above. 2. THE MATRIX REPRESENTATION Sample "MATRIX PARADIGMS" are shown in Fig(2) for verbs and Fig(3) for nouns. Table(1) gives the keys in English to the columns on the Matrix Paradigms. The inflected form for a given Person/Number~Gender/Mode combination for verbs (obtained from the relevant "row" of the Matrix Paradigm) is constructed by concatenating the prefix, core and both subject and object pronoun column entries. The inflected forms for nouns are sinfilarly c o n s t r u c t e d for a p a r t i c u l a r N u m b e r / G e n d e r / C a s e combinat ion . The various "cells" of the object pronoun columns indicate whether a particular entry is valid (indicated by "U', an Arabic numeral one). Invalid entries are indicated by a " '" , an Arabic zero. It is due to this matrix of ones and zeros that the representation was named the "Matrix Paradigm". 3. TAXONOMY OF ARABIC WORDS F i g ( l ) shows a t ree d iagram r e p r e s e n t i n g the taxonomical c lass i f ica t ion of Arab ic v e r b s and n o u n s . T h e r e a re d i f f e r e n t " levels" in the t ree c o r r e s p o n d to d i f f e r e n t t y p e s of va r i a t i ons of the inf lec ted form from one c lass to a n o t h e r . The f i r s t t ype of va r ia t ion co inc ides more or less wi th the t rad i t iona l c lass i f ica t ion and is r e s p r e s e n t e d at levels 2 and 3 fo r v e r b s and at level 2 fo r n o u n s . Each Matrix Paradigm also reflects two further types of variation, which can be considered separately from one another. The first is the variation in the core with the different rows; this dimension c o r r e s p o n d s , fo r example , to the t rad i t iona l s t u d y of v e r b con juga t ions ( see <2>). I -
منابع مشابه
A Study on Morpho-Syntactic Patterns: A Cohesive Device in Some Persian Live Sport Radio and TV Talks
Morpho-syntactic patterns device encompasses a subcategory of the cohesive devices that assists hearers to have an adequate mental representation for understanding speech. This article investigates the morpho-syntactic patterns employed in some Persian live sport radio and TV programs adapting Dooley and Levinsohn’s theoretical and analytical framework. The research data includes around 30,000 ...
متن کاملFrequency Effects of Regular Past Tense Forms in English on Native Speakers’ and Second Language Learners’ Accuracy Rate and Reaction Time
There is substantial debate over the mental representation of regular past tense forms in both first language (L1) and second language (L2) processing. Specifically, the controversy revolves around the nature of morphologically complex forms such as the past tense –ed in English and how morphological structures of such forms are represented in the mental lexicon. This study focuses on the resul...
متن کاملSecond-Order Statistical Texture Representation of Asphalt Pavement Distress Images Based on Local Binary Pattern in Spatial and Wavelet Domain
Assessment of pavement distresses is one of the important parts of pavement management systems to adopt the most effective road maintenance strategy. In the last decade, extensive studies have been done to develop automated systems for pavement distress processing based on machine vision techniques. One of the most important structural components of computer vision is the feature extraction met...
متن کاملA New Document Embedding Method for News Classification
Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...
متن کاملSemiotics of Collective Memory of the Iran-Iraq War (Holy Defence): A Case Study of the Shared Images in Virtual Social Networks
This study aims to achieve a semiotic understanding of collective memory of the Iran-Iraq war. For this purpose, samples of images in virtual social networks shared in response to the news of discovery and return of the bodies of more than 175 divers have been analyzed. Visual signs in photographs, cartoons, graphic designs, prints, paintings and posters, in methods of historical pictures and f...
متن کاملMental Timeline in Persian Speakers’ Co-speech Gestures based on Lakoff and Johnson’s Conceptual Metaphor Theory
One of the introduced conceptual metaphors is the metaphor of "time as space". Time as an abstract concept is conceptualized by a concrete concept like space. This conceptualization of time is also reflected in co-speech gestures. In this research, we try to find out what dimension and direction the mental timeline has in co-speech gestures and under the influence of which one of the metaphoric...
متن کامل